Real-world testing of an artificial intelligence algorithm for the analysis of chest X-rays in primary care settings

Interpreting chest X-rays is a complex task, and artificial intelligence algorithms for this purpose are currently being developed. It is important to perform external validations of these algorithms in order to implement them. This study therefore aims to externally validate an AI algorithm’s diagnoses in real clinical practice, comparing them to a radiologist’s diagnoses. The aim is also to identify diagnoses the algorithm may not have been trained for. A prospective observational study for the external validation of the AI algorithm in a region of Catalonia, comparing the AI algorithm’s diagnosis with that of the reference radiologist, considered the gold standard. The external validation was performed with a sample of 278 images and reports, 51.8% of which showed no radiological abnormalities according to the radiologist's report. Analysing the validity of the AI algorithm, the average accuracy was 0.95 (95% CI 0.92; 0.98), the sensitivity was 0.48 (95% CI 0.30; 0.66) and the specificity was 0.98 (95% CI 0.97; 0.99). The conditions where the algorithm was most sensitive were external, upper abdominal and cardiac and/or valvular implants. On the other hand, the conditions where the algorithm was less sensitive were in the mediastinum, vessels and bone. The algorithm has been validated in the primary care setting and has proven to be useful when identifying images with or without conditions. However, in order to be a valuable tool to help and support experts, it requires additional real-world training to enhance its diagnostic capabilities for some of the conditions analysed. Our study emphasizes the need for continuous improvement to ensure the algorithm’s effectiveness in primary care.

The ChestEye AI algorithm Oxipit 33 , one of the leading companies in AI medical image reading, has developed a fully automatic computeraided diagnosis (CAD) AI algorithm for reading chest X-rays trained with more than 300,000 images, available through a web platform called ChestEye.The ChestEye imaging service has been certified as a Class II medical device on the Australian Register of Therapeutic Goods and has also been CE marked 33 .The web platform reads the inserted chest X-ray and returns the automatic report with the ability to detect 75 conditions, which cover 90% of diagnoses, as well as a heat map to show the locations of the findings.Thus, ChestEye allows radiologists to analyse only the most relevant X-rays 33,34 .

Study design
A prospective observational study for the external validation of the AI algorithm in a region of Catalonia of users who were scheduled for chest radiography at the Osona Primary Care Center.For each user, the report of the reference radiologist (considered the gold standard) was obtained.Subsequently, the research team input the image into the AI algorithm to obtaine the diagnosis.This allowed for the comparison of the AI's performance with the reference standard in terms of accuracy, sensitivity, specificity, positive predictive value and negative predictive value c.

Description of the study population, time frame, and data collection
The study was carried out at the Catalan Institute of Health's Primary Care Centre Vic Nord (Osona, Catalonia, Spain), a reference centre where all chest X-rays in the region are performed (with a coverage of 125,000 users).At this same centre, convenience recruitment was carried out from 7 February 2022 to 31 May 2022.The study was explained, and the information sheet and informed consent were given to all patients who came for a chest X-ray and met the inclusion criteria 32 .
The reference population of the study was the entire population of the Osona region who underwent chest X-rays at the study centre and agreed to participate in the study.The study included only anteroposterior chest X-rays on those over 18 years of age and excluded pregnant women and poor-quality chest X-rays (poor exposure, non-centred or rotated images).

Sample size
Due to a problem with the image collection centre, the sample size calculated in the protocol 32 could not be reached.For this reason, the sample was recalculated by increasing the precision by one percentage point.Thus, in order to validate the algorithm, a sample of 450 images was needed to estimate an overall accuracy expected

Procedure
Once recruitment was completed, the Technical Service of the Catalan Institute of Health of Central Catalonia extracted the patients' anonymous and automated images and their corresponding non-anonymised reports.The images and reports were then coded together so that they could be related.
The research team then entered all the images into the AI model to extract their interpretation (the diagnosis or the possibility of no abnormalities).At the same time, three general practitioners interpreted the reference radiologists' reports, without seeing the images in order to avoid assumptions, with the aim of extracting the diagnoses described.
Finally, the group of general practitioners grouped all the conditions detectable by the AI model into 9 categories according to the anatomy of the thorax in order to build an individual and grouped study.The categories were: external implants, mediastinal findings or conditions, cardiac and valvular conditions, vessel conditions, bone conditions, pleural or pleural space conditions, upper abdominal findings or conditions, pulmonary parenchymal findings or conditions, and others.

Statistical analysis
To validate the algorithm, the AI algorithm's diagnoses were compared with those of the gold standard.The accuracy of the algorithm and the confusion matrix were obtained from the images correctly classified positive (PT), correctly classified negative (TN), false positive (FP) and false negative (FN).Sensitivity and specificity were also calculated.These measurements were obtained for the total sample, for each condition and for each of the categories according to physiology.Analyses were performed with R software version 4.2.1 and all confidence intervals were 95%.

Ethics committee
The University Institute for Research in Primary Health Care Jordi Gol i Gurina (Barcelona, Spain) ethics committee approved the trial study protocol (approval code: 21/288).Written informed consent was requested from all patients participating in the study.

Ethical considerations
Radiologists' assessment and decisions were not influenced by this study, as the normal radiology referral workflow was not affected.This project was approved by the Research Ethics Committee (REC) from the Foundation University Institute for Primary Health Care Research Jordi Gol i Gurina (IDIAPJGol) (P21/288-P).The study was performed in accordance with relevant guidelines/regulations, and informed consent was obtained from all participants.All research was performed in accordance with the Declaration of Helsinki.

Results
Of the 471 patients who agreed to participate in the study and provide the images and reports, the final sample for external validation of the model was 278, mainly due to computer-related issues when extracting the data.In some cases, when automatically extracting images and reports, both were not obtained, i.e., either the image was missing, or the report was missing.In addition, some reports did not include the interpretation of the image, as it was a follow-up X-ray.In these cases, the report only indicated whether or not there were changes with respect to the previous report and, therefore, they had to be discarded from the analysis (Fig. 1).Of the final sample, 144 (51.8%) obtained images without radiological abnormalities according to the radiologist's report.
Analysing the validity of the AI algorithm, the average accuracy was 0.95 (95% CI 0.92; 0.98), the average sensitivity was 0.48 (95% CI 0.30; 0.66) and the average specificity was 0.98 (95% CI 0.97; 0.99).The accuracy, sensitivity and specificity values for each condition can be seen in Table 2.The values for true positives, true negatives, false positives, and false negatives for each condition are presented in Table 1 of the supplementary information.
The reference radiologist identified a list of conditions for which the algorithm was not trained, and which were therefore classified as "Other".The most prevalent were bronchial wall thickening (n = 13, 4.68%), fibroscarring lesions or abnormalities (n = 11, 3.96%) and chronic pulmonary abnormalities (n = 11, 3.96%) (Table 3).
Figures 2 and 3 show some examples of the AI algorithm's performance in cases where the algorithm's diagnosis was successful and in cases where errors occurred.
In order to perform a more general analysis, the conditions were grouped into 10 groups according to chest anatomy, considering only the diagnoses for which the AI model was trained.According to the radiologist, the most prevalent groupings found were lung parenchymal conditions (n = 71, 25  4).Finally, Table 5 shows the accuracy, sensitivity and specificity values for each group.Of these, it is worth mentioning the low sensitivity values for mediastinal conditions (0.0, 95% CI 0.0; 0.96), vessel conditions (0.29, 95% CI 0.11; 0.52) and bone conditions (0.24, 95% CI 0.08; 0.47).On the other hand, high sensitivity values were recorded for external implants (0.67, 95% CI 0.22; 0.96), upper abdominal conditions (0.67, 95% CI 0.30; 0.93) and cardiac and/or valvular conditions (0.67, 95% CI 0.35; 0.90).It is also worth mentioning the model's strong ability to detect images that do not have radiological abnormalities.The values for true positives, true negatives, false positives, and false negatives for each grouping are presented in Table 1 of the supplementary information.

Discussion
The aim of this study was to perform an external validation, in real clinical practice, of the diagnostic capability of an AI algorithm with respect to the reference radiologist for chest X-rays, as well as to detect possible diagnoses for which the algorithm had not been trained.Thus, the overall accuracy of the algorithm was 0.95 (95% CI 0.92-0.98), the sensitivity was 0.48 (95% CI 0.30-0.66)and the specificity was 0.98 (95% CI 0.97-0.99).The results obtained have further highlighted, as indicated by different expert groups 26,28,29 , the need for external validations of AI algorithms in a real clinical context in order to establish the necessary measures and adaptations to ensure safety and effectiveness in any environment.Therefore, in the context of the model developed, it is important to understand and interpret what each of the results obtained indicate.
High accuracy values were observed in most cases (ranging between 0.7-1).The accuracy is represented by the proportion of correctly classified results among the total number of cases examined.This value was high since, both for each condition and for the groups of conditions, the capacity to detect true negatives was good, taking into account that most of the images analysed were found to have no abnormalities (51.8%).Working with an AI algorithm that quickly determines that there is no abnormality can function as a triage tool, streamlining the diagnostic process, allowing the professional to focus on other tests, reduce waiting lists, reduce waiting times for diagnoses and even reduce expenses in secondary tests.
With sensitivity referring to the ability to detect an abnormality when there really is one, high sensitivity values were shown when detecting anatomical findings or abnormalities such as sternal cables, enlarged heart, abnormal ribs, spinal implants, cardiac valve, or interstitial markings.On the other hand, low sensitivity values were observed for most conditions, indicating that the algorithm had limited ability to detect certain conditions like those in the mediastinum, vessels, or bones.These findings align with the results of a study that performed an external validation of a similar algorithm in an emergency department 35 .Additionally, the algorithm exhibited low sensitivity in detecting pulmonary emphysema, linear atelectasis, and hilar prominence, which are prevalent conditions in the primary care setting 31 .
Low sensitivity was also observed when detecting nodules, with the algorithm finding more nodules than the reference radiologist, in most cases confusing them with areolae in the breast tissue.Although it is important to be able to detect any warning signs and that the professional is in charge of making the clinical judgement and www.nature.com/scientificreports/determining the need for complementary tests, it is possible that this external validation has detected a possible gender bias in the training of the algorithm.When it comes to chest imaging, it's important to distinguish between the physiological aspects of breast tissue and any potential changes it may undergo during various life stages, as opposed to signs of conditions or abnormalities 36 .Other studies have also detected a high false positive value in the detection of nodules due to other causes such as fat, pleura or interstitial lung disease 37 .
Table 1.Description of the conditions or anatomical abnormalities of the 278 images and their respective diagnoses according to the radiologist and the AI algorithm.www.nature.com/scientificreports/COPD, and fibrocystic abnormalities.Furthermore, it was noted that certain condition names within the AI algorithm should be adjusted to align with names used in the radiology field.Interstitial markings could be changed to interstitial abnormality, consolidation to condensation, aortic sclerosis to valvular sclerosis, and abnormal rib to rib fracture.
Once the main variables that characterise the algorithm's capacity were discussed, the results obtained differ from the majority of published studies, since most of them obtained a higher algorithm capacity.However, it should be noted that most of these are internal validations and not tested in real clinical practice settings [38][39][40] .
A study in Korea performed an internal and external validation of an AI algorithm capable of detecting the 10 most prevalent chest X-ray abnormalities and was able to demonstrate the difference in sensitivity and specificity values.The internal validation obtained sensitivity and specificity values between 0.87-0.94and 0.81-0.98,respectively.On the other hand, the external validation obtained sensitivity and specificity values between 0.61-1.00and 0.71-0.98,respectively 41 .This difference can also be seen in a study in Michigan, where internal and external validation of an AI algorithm capable of detecting the most common chest X-ray abnormalities was performed 42 , and in a study at the Seoul University School of Medicine, where an algorithm for lung cancer detection in population screening was validated 43 .
Therefore, the results obtained from the external validation show the need to increase the sensitivity of the algorithm for most conditions.Considering that AI should serve as a diagnostic support tool and the ultimate responsibility for medical decisions rests with the practitioner, it is ideal for the algorithm to flag potential abnormalities for the practitioner to review and confirm.This ensures the highest diagnostic effectiveness.Recent studies have shown that the use of an AI algorithm to support the practitioner significantly improves diagnostic sensitivity and specificity and reduces image reading time 20,44 .
Enhanced sensitivity could help address the shortage of specialised radiologists globally, especially in Central Catalonia's primary care setting, where this validation was conducted 45,46 .More and more, general practitioners are tasked with interpreting X-rays.In this context, the advancement of these tools can be a valuable asset in the diagnostic process.

Limitations and strengths
One significant limitation of the study was the small sample size for certain specific conditions.This was due to difficulties in obtaining the required number of cases, as these conditions are not very common in real clinical practice.Consequently, the external validation for these conditions yielded less reliable estimates.However, by  In addition, the radiologist's reference diagnosis was not always the practitioner's own, but that of a group of radiologists.This could represent a limitation, since there was no consensus among them, but there was no desire to alter actual clinical practice.In addition, the study aimed to test the algorithm in primary care settings.For this reason, a double interpretation of the images was performed: initially by the radiologist and subsequently, the radiologist's report was interpreted by the family and community physician.Finally, another limitation was the lack of information on the sex of the users analysed.Through the results obtained, we found it very relevant to do another study but separating the capabilities of the algorithm according to gender, since it seems that they might not be the same.In addition, since we have a small sample for most of the conditions, separating the analyses according to sex in the present study would be unreliable and not comparable.
On the other hand, the greatest strength of the study is that it presents an external validation in real clinical practice in primary care and there are currently few studies that have done so.Most studies present an internal validation, but it is very important to perform an external validation in order to estimate the accuracy of the model in a population other than the training population, thus allowing the results to be generalised.

Conclusion
The findings of this study demonstrate the validation of an AI algorithm for reading chest X-rays in the primary care setting, achieved by comparing its diagnoses with those made by a radiologist.The algorithm has been validated in the primary care setting using values such as the accuracy, sensitivity and specificity of the algorithm and has proven to be useful by being able to identify images with or without abnormalities.However, further training is needed to increase the diagnostic capability of some of the conditions analysed.It is important that training is done in a real environment, with real images, in order to perform robust external validations.Our analysis highlights the need for continuous improvement to ensure that the algorithm is a reliable and effective tool in the primary care environment.
The role of AI in healthcare should be to assist and support the practitioners.Being able to reliably detect images without abnormalities can have a very positive impact, reducing waiting times for diagnoses, secondary tests to rule out conditions, streamlining practitioners work and, among others, ultimately favouring patient care and, indirectly, their health.

Figure 1 .
Figure 1.Flow chart of the final sample of study images.

Figure 2 .
Figure 2. Image of patient (upper-left) where according to the radiologist's report there is only consolidation, but the algorithm detects an abnormal rib (upper-right), consolidation (lower-left) and two nodules (lowerright).It is worth noting the confusion of a consolidation with mammary tissue and of two nodules with the two mammary areolae.

Figure 3 .
Figure 3. Image of patient (left) where the AI algorithm and the radiologist detected the same condition: consolidation (right).

Table 3 .
Description of conditions not contemplated by the AI model.

Table 4 .
Description of the conditions of the 278 images according to the radiologist and AI model, grouped according to chest anatomy.

Table 5 .
Accuracy, sensitivity and specificity values for each grouping.representingreality, a large volume of images without radiological abnormalities was obtained and this allowed for a good external validation of the model's ability to detect images without abnormalities.